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Open source enthusiasm 


e Nouveau & mesa contributor 
e performance counters (most of the time) & small GL bug fixes 


ə Google Summer of Code student in 2013 & 2014 
o XDC talk last year in Bordeaux, France 


Real life job 
o Got my master degree last year 
o HPC engineer at INRIA, Bordeaux 


e developing a source-to-source OpenMP compiler (Clang) 
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Introduction 


Summary 


@ Introduction 
e What are performance counters ? 
ə NVIDIA's perf counters 
o NVIDIA's profiling tools 
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Performance counters 


ə are blocks in modern processors that monitor their activity 


e count low-level hardware events such as cache hits/misses 


ə to analyze the bottlenecks of 3D and GPGPU applications 


e to dynamically adjust the performance level of the GPU 


How to use them ? 
e GUls like NVIDIA Nsight and Linux Graphics Debugger 
o APIs like NVIDIA CUPTI and PerfKit 


e OpenGL extensions like GL AMD performance monitor 
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Introduction 
e 


O 


Two groups of counters exposed 


ə compute counters for GPGPU applications 
ə ex: warps_ launched, divergent_ branch ... 

ə graphics counters for 3D applications 

ə ex: shader busy, texture busy ... 


Different types of counters 


ə global counters 
ə collect activities regardless of the context 

ə local counters 

ə collect activities per-context only 
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Visual Profiler 


e cross-platform performance profiling tools for CUDA apps 


e based on CUPTI API (expose compute-related counters) 


e Visual Studio plugin for profiling GL/D3D apps (Windows) 


e based on PerfKit API (expose graphics-related counters) 


Linux Graphics Debugger 
e performance profiling tools for GL apps (SIGGRAPH'15) 


@ expose graphics-related counters on Linux (yeah!) 
e unfortunately, no API like PerfKit is provided 
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Case study 


Summary 


@ Case study 
o Improve a GL app with NVIDIA's tools 
e What about Nouveau ? 
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Case study 
Jejeje) 


Improve a GL app 


How to improve performance of a GL app using perf counters ? 
Let's try NVIDIA Linux Graphics Debugger! 
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Case study 


Figure : A brain rendered in OpenGL with 165786 voxels 
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Improve a GL app 


Perf counters Values 


FPS En 
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Improve a GL app 


Perf counters 


Values 


FPS 
geom_ busy 1% 
shader_ busy 0.2% 
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Improve a GL app 


Perf counters 


Values 


FPS 

geom_ busy 1% 
shader_ busy 0.2% 
texture busy 0.5% 
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Improve a GL app 


Perf counters 


Values 


la_ requests 


FPS 

geom_ busy 1% 

shader busy 0.2% 

texture busy 0.5% 
350000 
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Case study 
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Improve a GL app 


Perf counters 


Values 


FPS 

geom_ busy 1% 
shader_ busy 0.2% 
texture busy 0.5% 
la_ requests 350000 


12 read _sysmem_ sectors | 200000 | 
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Case study 
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Improve a GL app 


Perf counters Values 
FPS 
geom_ busy 1% 
shader_ busy 0.2% 
texture busy 0.5% 
la_ requests 350000 
12 read sysmem_ sectors | 200000 | 
mmh... 


[2 read sysmem_ sectors seems to very high and this is probably 
one of the bottlenecks! 
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Case study 
jejeje] ] 


Improve a GL app 


Problem 


e too many memory reads from the system memory 
e due to the GPU fetching the vertices at every frame 


¡There are probably other bottlenecks but this is just a basic example 
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Case study 
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Problem 


e too many memory reads from the system memory 
e due to the GPU fetching the vertices at every frame 


Solution 
e use a vbo to store the vertices on the GPU 


Perf counters Without VBO | With VBO 
FPS 

geom_ busy 1% 1% 
shader_ busy 0.2% 0.2% 
texture busy 0.5% 0.5% 

la_ requests 350000 250000 

12 read sysmem_ sectors NO 35 | 


¡There are probably other bottlenecks but this is just a basic example 
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Case study 
What about Nouveau ? 
No tools like Linux Graphics Debugger! 
… but things are going to change! 
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What about Nouveau ? 


No tools like Linux Graphics Debugger! 


… but things are going to change! 


Perf counters project 
o started since GSoC'13 


@ not a trivial project and a ton of work 


o reverse engineering (long and hard process) 
o kernel and userspace support (including APIs & tools) 


12/35 


Case study 
e 


No tools like Linux Graphics Debugger! 


… but things are going to change! 


Perf counters project 
o started since GSoC'13 


@ not a trivial project and a ton of work 


o reverse engineering (long and hard process) 
o kernel and userspace support (including APIs & tools) 


Goals & Benefits 


e expose perf counters in a useful and decent manner 


o help developers to find bottlenecks in their 3D applications. 
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Reverse engineering 


Summary 


O Reverse engineering 
e Compute-related counters 
o Graphics-related counters 
o Current status 
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Reverse engineering 


Compute-related counters 


Requirements 
e CUDA and CUPTI API (CUDA Profiling Tools Interface) 
e valgrind-mmt and demmt (envytools) 


e cupti trace from envytools repository 
e tool which helped me a lot in the REing process 
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Requirements 
e CUDA and CUPTI API (CUDA Profiling Tools Interface) 


e valgrind-mmt and demmt (envytools) 


e cupti trace from envytools repository 
e tool which helped me a lot in the REing process 


How does it work? 


@ launch cupti trace (ie. cupti trace -a NVXX) 
e will automatically trace each hardware event exposed 


@ grab a cup of coffee :) and wait few minutes 


@ traces are now saved to your disk 


@ analyze and document them 
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Reverse engineering 
e 


Graphics-related counters 


Reverse engineering PerfKit on Windows 
ə really painful and very long process! :( 
ə no MMIO traces and no valgrind-mmt 


e need to do it by hand (dump registers, etc) 
e very hard to find multiplexers 
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Reverse engineering PerfKit on Windows 
e really painful and very long process! :( 
ə no MMIO traces and no valgrind-mmt 


e need to do it by hand (dump registers, etc) 
e very hard to find multiplexers 


Reverse engineering LGD on Linux 


ə this Linux Graphics Debugger saved my brain! :) 
e almost same process as compute-related counters; 
e but not automatically because it's a GUI. 


e really easy to find multiplexers this time. 
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Reverse engineering 
e 


Current status 


e DONE means it’s fully reversed and documented 


o MOSTLY means that some perf counters are reversed 


o WIP means that | started the reverse engineering process 
ə TODO means that it’s on my (long) todolist 


Perf counters Tesla Fermi | Kepler | Maxwell 
Graphics 
Compute 


"Except per-context counters (requires PerfKit). 
?Need to RE new counting modes. 


3Only on GM107 and need to RE per-context counters logic. 
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Nouveau & mesa 


Summary 


@ Nouveau & mesa 
e Kernel interface 
e Synchronization 
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Nouveau & mesa 
e 


Kernel interface 


Why is a kernel interface needed ? 


e because global counters have to be programmed via MMIO 


e only root or the kernel can write to them 


What the interface has to do ? 
@ set up the configuration of counters 
e poll counters 


e expose counter's data to the userspace (readout) 
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Nouveau & mesa 
e 
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Synchronizing operations 
ə CPU: ioctls 
ə GPU: software methods 


Software method 


e command added to the command stream of the GPU context 
@ upon reaching the command, the GPU is paused 
o the CPU gets an IRQ and handles the command 
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Nouveau & mesa 
e 


Nouveau 


Perfmon work 
e expose low-level configuration of perf counters 
o include lot of signals/sources for Tesla, Fermi and Kepler 
o allow to schedule/monitor perf counters from the userspace 


e based on nvif (ioctls interface) 


e no Perf support is planned for now! 
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Nouveau & mesa 
o 


patches series already submitted to mesa-dev (pending) 
o because this requires a libdrm release with nvif support 


e will expose around 30 global perf counters 


ə will enable GL AMD performance monitor 


patches still in my local tree but almost ready 


will expose around 80 global perf counters for Fermi/Kepler 
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APls & Tools 


Summary 


@ APIs & Tools 
o GL_ AMD performance monitor 
o Nouveau PerfKit 
o Apitrace 
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APls & Tools 
o 


GL AMD _ performance monitor 


OpenGL extension 


e based on pipe query interface 


e drivers need to expose a group of GPU counters to enable it 
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OpenGL extension 


o based on pipe query interface 


drivers need to expose a group of GPU counters to enable it 


released in mesa 10.6 


© 


© 


expose per-context counters on Fermi/Kepler 
e this requires compute support to launch kernels 


© 


used by Apitrace for profiling frames (GSoC'15) 
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OpenGL extension 


o based on pipe query interface 


e drivers need to expose a group of GPU counters to enable it 


e released in mesa 10.6 
o expose per-context counters on Fermi/Kepler 


e this requires compute support to launch kernels 


e used by Apitrace for profiling frames (GSoC'15) 


ə do not support round robin sampling and multi-pass events 
e do not fit well with NVIDIA hardware (obviously) 
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APls & Tools 
e 


Nouveau PerfKit 


Linux version of NVIDIA PerfKit 
e built on top of mesa (as a Gallium state tracker like VDPAU) 


@ needed to reverse engineer the API (return codes, etc) 
e around 100 unit/functional test have been written 


e implemented libperfkit with both Windows and Linux support 
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Linux version of NVIDIA PerfKit 
ə built on top of mesa (as a Gallium state tracker like VDPAU) 


@ needed to reverse engineer the API (return codes, etc) 
ə around 100 unit/functional test have been written 


o implemented libperfkit with both Windows and Linux support 


@ allow support of round robin sampling and multi-pass events | 


e RFC submitted in June (around 1700 LOC, still in review) 


e will expose more perf counters than gl_amd__ perfmon 


e no users for now but Apitrace could use PerfKit 
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APls & Tools 
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Apitrace 


GSoC'15 project 
ə add support for performance counters in the profiling view 


e project by Alex Tru (mentored by Martin Peres) 
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GSoC'15 project 


e add support for performance counters in the profiling view 


e project by Alex Tru (mentored by Martin Peres) 


DONE (but still not upstream) 


ə abstraction system for profiling in glretrace 


ə support for GL_AMD_ perfmon and Intel perfquery 
e allow to query and to monitor metrics 
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GSoC'15 project 


e add support for performance counters in the profiling view 


e project by Alex Tru (mentored by Martin Peres) 


DONE (but still not upstream) 


ə abstraction system for profiling in glretrace 


ə support for GL_AMD_ perfmon and Intel __ perfquery 
e allow to query and to monitor metrics 


e profiling view improvements for qapitrace 
e some minor parts are done but very basic visualization 
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APls & Tools 
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Apitrace 


Let's go back to the case study but now with... 


… Apitrace and Nouveau! 
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APls & Tools 
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Apitrace/Nouveau 


How to list available metrics? 


o glretrace —list-metrics <trace> 


Backend GL AMD performance monitor: 


Group #0: Global performance counters. 

Metric #0: shader busy (type: CNT TYPE GENERIC, type: CNT NUM UINT64) 

Metric #1: ia requests (type: CNT TYPE GENERIC, type: CNT NUM UINT64). 
Metric #2: texture busy (type: CNT TYPE GENERIC, type: CNT _NUM_UINT64). 


Group #1: MP counters. 
Metric #0: active cycles (type: CNT TYPE GENERIC, type: CNT NUM UINT64). 
Metric #1: active warps (type: CNT TYPE GENERIC, type: CNT NUM UINT64). 
Backend opengl: 
Group #0: CPU. 


Metric #0: CPU Start (type: CNT TYPE TIMESTAMP, type: CNT NUM INT64). 
Metric #1: CPU Duration (type: CNT TYPE DURATION, type: CNT NUM INT64). 
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Name frames draw calls Description ` 
+ [GL AMD performance_monitor 
+ Global performance counters 

setup_primitive_count 
ia_requests 
l17local_load_hit 
ilocal load miss 
ilocal store_hit 
IL_local_ store miss 
I1“global load hit 
IL_global load miss 
uncached_global_load transaction 
global_ store transaction 
11_shared bank conflict 
I shared load transactions 
IL shared storë transactions 
elapsed cycles 
shaded pixel count 
shader busy 
shd Il requests 
shd tex requests 
sm_active_cycles 
sm_active_warps 
sm_branches 
sm_divergent_branches 
sm_cta_ launched Profile 
sm_inst_executed 
sm_inst_executed_atom 
sm_inst_executed_atom_cas 
sm_inst_executed_cs 
sm_inst_executed_global_loads 
sm_inst_executed_global_stores 
sm_inst_executed_gs 
sm_inst_executed local loads 
sm_inst_executed local stores 
sm_inst executed ps < 


Passes: 0 


Cancel 


Figure : List of available metrics in Apitrace 
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APls & Tools 
0000000 


Apitrace/Nouveau 


How to profile a GL app? 
o glretrace -pframes="GL_AMD _ perfmon: [0,65]" <trace> 


# ia requests 
frame 285734 
frame 285799 
frame 285793 
frame 285763 
frame 285762 
frame 285809 
frame 285800 
frame 285744 
frame 285743 
frame 285796 
frame 285893 
frame 285818 
frame 285754 
frame 285804 
frame 285762 
frame 285763 
frame 285813 
frame 285804 
frame 285815 
frame 285747 
frame 285754 


Rendered 20 frames in 0.3365 secs, average of 59.4344 fps 
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APls & Tools 
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Frames | Draw calls 


Add metrics 

id ia requests E 
0 285742 
1 285688 
2 285797 
3 285790 
4 285767 
5 285815 
6 285759 
7 285742 
8 285737 

9 285776 z 


o : Very basic visualization with histograms in Apitrace 


APls & Tools 
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Apitrace/Nouveau 


Perf counters Without VBO | With VBO 
geom_ busy 7% 17% 
shader busy 0.5% 1% 
texture busy 2% 4% 

la_ requests 371000 286000 

[2 read sysmem_ sectors OT 35 | 
FPS 251 1601 


Without reclocking 
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Conclusion 


Summary 


Q Conclusion 
e Current status 
o Future work 
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Conclusion 


Current status 


Reverse engineering 


ə almost all perf counters on Tesla, Fermi and Kepler reversed 


Nouveau DRM & mesa 


@ perfmon work merged in Linux 4.3 


o GL_AMD_performance_ monitor merged in mesa 10.6 
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Conclusion 
o 


Reverse engineering 


ə almost all perf counters on Tesla, Fermi and Kepler reversed 


Nouveau DRM & mesa 
e perfmon work merged in Linux 4.3 


o GL AMD performance monitor merged in mesa 10.6 


Userspace tools 


ə GL AMD perfmon used by Apitrace! 


e perf counters are going to be exposed in a useful manner. :) 
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Conclusion 


e 


Future work 


Short-term period 

e add more signals & sources for Fermi and Kepler 
o rework the software methods interface 

e release libdrm with nvif support (Ben Skeggs) 


e complete the support of perf counters in mesa 


ə this will expose GL_amd_perfmon on Tesla 
e this will expose lot of perf counters on Tesla, Fermi and Kepler 
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e 


Future 


Short-term period 
e add more signals & sources for Fermi and Kepler 
e rework the software methods interface 


e release libdrm with nvif support (Ben Skeggs) 
e complete the support of perf counters in mesa 


ə this will expose GL_amd_perfmon on Tesla 
e this will expose lot of perf counters on Tesla, Fermi and Kepler 


Long-term period 


e finish implementation of Nouveau PerfKit 
e and make something use it (Apitrace?) 


@ reverse engineer Maxwell performance counters 
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Conclusion 


Thanks! 


| would like to thank the X.Org board members for my travel 
sponsorship! 


Feel free to ask questions... 
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